Active Data: A Programming Model for Managing Big Data Life Cycle

نویسندگان

  • Anthony Simonet
  • Gilles Fedak
  • Matei Ripeanu
چکیده

The Big Data challenge consists in managing, storing, analyzing and visualizing these ever growing huge datasets to extract sense and knowledge. As the volume of data grows exponentially, the management of these data becomes more complex in proportion. A key point is to handle the complexity of the data life cycle, i.e. the various operations performed on data: transfer, archiving, replication, deletion. . . To alleviate the complexity of the data life cycle, we propose Active Data, a programming model to automate and improve the expressiveness of data management applications. We first introduce the concept of data life cycle and define a formal model based on Petri Net. We present the concept of the Active Data programming model, which allows code execution at each stage of the data life cycle. With Active Data, routines provided by programmers are executed when a set of events (creation, replication, transfer, deletion) happen to any data. We implement and evaluate the model with three use cases: a storage cache to Amazon S3, a cooperative sensor network, and an incremental implementation of the MapReduce programming model. Altogether, these scenarios illustrate the adequateness of the model to program applications which manage distributed and dynamic data. We also show that applications that do not leverage on data life cycle can benefit from Active Data to improve their performances. Key-words: programming model, distributed storage system, file system ∗ INRIA, University of Lyon † University of British Columbia, Canada Active Data : un modèle de programmation pour la gestion des cycles de vie de Big Data Résumé : Le défi Big Data consiste à gérer, stocker, analyser et visualiser des jeux de données toujours plus grands pour en extraire sens et connaissance. Alors que ces volumes de données croissent de manière exponentielle, leur gestion s’en complique d’autant. Un point clef est d’aborder la complexité du cycle de vie des données, c’est à dire les diverses opérations dans lesquelles elles sont impliquées : transfert, archivage, réplication, suppression. . . Pour diminuer la complexité des cycles de vie des données, nous proposons Active Data, un modèle de programmation pour automatiser et améliorer l’expressivité des applications de gestion de données. Premièrement, nous présentons le concept de cycle de vie de données et définissons un modèle formel basé sur les Réseaux de Pétri. Nous présentons ensuite le modèle de programmation Active Data qui permet l’exécution de code à chaque étape du cycle de vie d’une donnée. Avec Active Data, des routines fournies par le programmeur sont exécutées lorsqu’un ensemble d’événements (création, réplication, transfert, suppression) se produit sur n’importe quelle donnée. Nous implémentons et évaluons le modèle avec 3 cas d’utilisation : un cache entre une application et Amazon S3, un réseau de senseurs coopératifs et une implémentation incrémentale du modèle de programmation MapReduce. Ces scénarios illustrent l’adéquation du modèle avec la programmation d’applications qui gèrent des données distribuées et dynamiques. Nous montrons également que des applications qui ne tirent pas partie du cycle de vie des données peuvent bénéficier d’Active Data pour améliorer leurs performances. Mots-clés : modèle de programmation, système de stockage distribué, systeme de fichiers

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Cloud Computing Technology Algorithms Capabilities in Managing and Processing Big Data in Business Organizations: MapReduce, Hadoop, Parallel Programming

The objective of this study is to verify the importance of the capabilities of cloud computing services in managing and analyzing big data in business organizations because the rapid development in the use of information technology in general and network technology in particular, has led to the trend of many organizations to make their applications available for use via electronic platforms hos...

متن کامل

Active Data: A programming model to manage data life cycle across heterogeneous systems and infrastructures

The Big Data challenge consists in managing, storing, analyzing and visualizing these huge and ever growing data sets to extract sense and knowledge. As the volume of data grows exponentially, the management of these data becomes more complex in proportion. A key point is to handle the complexity of the data life cycle, i.e. the various operations performed on data: transfer, archiving, replica...

متن کامل

Managing Environmentally Conscious in Designing Closed-loop Supply Chain for the Paper Industry

High amounts of waste paper are disposed of every year in Iran posing the health hazard and environmental damages instead of being recovered. Collection, recovery and proper disposal of waste paper without damaging the environment need to design an efficient closed-loop supply chain network. The main objective of this paper is introducing a bi-objective, multi-echelon, multi-product and single-...

متن کامل

Effect of family structure on urban areas modal split by using the life cycle concept

The modal split model is one of the steps of the classical four-step travel demand planning. Predictive, descriptive, and prescriptive modal split models are essential to make a balance between travel demand and supply. To calibrate these models, it is necessary to detect and employ influential independent variables that are related to characteristics of travel modes, individual and family attr...

متن کامل

A new solving approach for fuzzy multi-objective programming problem in uncertainty conditions by ‎using semi-infinite linear programing

In practice, there are many problems which decision parameters are fuzzy numbers, and some kind of this problems are formulated as either possibilitic programming or multi-objective programming methods. In this paper, we consider a multi-objective programming problem with fuzzy data in constraints and introduce a new approach for solving these problems base on a combination of the multi-objecti...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012